On cherche à étudier l’effet de trois facteurs sur le transcriptome des racines d’Arabidopsis. On cherche ici à utiliser et comparer différents packages de multiDE sur nos données.

Rappel contexte

Design, contrastes, et test DE Quelle différence avec une seule condition?

EgdeR and DESeq2 Fit a negative binomial generalized log-linear model to the read counts for each gene. Conduct genewise statistical tests for a given coefficient or coefficient contrast. Ce modèle est de la forme

\(log(\mu_{gj}) \hat= \beta_o + \beta X\) knowming that \(X \sim NB(\mu_{gj}, \sigma_g)\)

Generalized linear modelsGeneralized linear models (GLMs)

are an extension of classical linear models to nonnormallydistributed response data. GLMs specify probability distributions according to their mean-variance relationship, for example the quadratic mean-variance relationship specified above forread counts. Assuming that an estimate is available forφg, so the variance can be evaluatedfor any value ofμgi, GLM theory can be used to fit a log-linear modellogμgi=xTiβg+ logNifor each gene [22]. Herexiis a vector of covariates that specifies the treatment conditionsapplied to RNA samplei, andβgis a vector of regression coefficients by which the covariateeffects are mediated for geneg. The quadratic variance function specifies the negativebinomial GLM distributional family. The use of the negative binomial distribution is equivalentto treating theπgias gamma distributed.

Estimating dispersionsFor general experiments (with multiple factors)

edgeR uses the Cox-Reid profile-adjustedlikelihood (CR) method in estimating dispersions [22]. The CR method is derived to overcomethe limitations of the qCML method as mentioned above. It takes care of multiple factors byfitting generalized linear models (GLM) with a design matrix.The CR method is based on the idea of approximate conditional likelihood [?]. Given atable counts or aDGEListobject and the design matrix of the experiment, generalized linearmodels are fitted. This allows valid estimation of the dispersion, since all systematic sourcesof variation are accounted for.

https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Import des données

            R3   R2   R1   R5  R4   R6  R7  R11  R10  R12  R13  R17  R16  R14
AT1G01010 1526 1006 1116 1275 967 1018 854 1132 1294 1364 2325 2113 2193 2564
AT1G01020  416  285  289  349 364  370 226  513  502  561  461  407  432  614
AT1G01030   31   15   19   29  36   28  12   47   34   47   18   40   32   44
           R15  R18  R24  R19  R22  R23  R21  R20  R9  R8
AT1G01010 2364 2074 1987 2027 1754 1697 1537 1898 816 912
AT1G01020  380  502  484  467  426  415  413  462 223 312
AT1G01030   37   27   42   39   36   40   37   37  15  19
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
[1] 28775    24
[1] "cnF_3"
[1] At_AmbientCO2_LowNitrate_Fe1
48 Levels: At_AmbientCO2_HighNitrate_Fe1 ... Sl_ElevatedCO2_LowNitrate_FeStarvation3
[1] "At_AmbientCO2_LowNitrate_Fe"

Analyse globale avec edgeR

TCC fait le test globalement, avec un layout “one way”, qui prend les 8 conditions comme toutes différentes. En spécifiant un design de combinatoire \(2*2*2\), on peut avoir des coefficients testés contre 0 pour chacune des variables et leur interactions.

glmLRT conducts likelihood ratio tests for one or more coefficients in the linear model. If coef is used, the null hypothesis is that all the coefficients indicated by coef are equal to zero. If contrast is non-null, then the null hypothesis is that the specified contrasts of the coefficients are equal to zero. For example, a contrast of c(0,1,-1), assuming there are three coefficients, would test the hypothesis that the second and third coefficients are equal.

While the likelihood ratio test is a more obvious choice for inferences with GLMs, the QLF-test is preferred as it reflects the uncertainty in estimating the dispersion for each gene. Itprovides more robust and reliable error rate control when the number of replicates is small.The QL dispersion estimation and hypothesis testing can be done by using the functionsglmQLFit()andglmQLFTest()

      group lib.size norm.factors
cNF_3   cNF 32033146            1
cNF_2   cNF 25708565            1
cNF_1   cNF 25389227            1
cnF_2   cnF 29410954            1
cnF_1   cnF 26720832            1
cnF_3   cnF 26429803            1
CNF_1   CNF 16944343            1
CnF_2   CnF 28518532            1
CnF_1   CnF 30134923            1
CnF_3   CnF 30939471            1
cNf_1   cNf 30097577            1
cnf_2   cnf 29411503            1
cnf_1   cnf 32417735            1
cNf_2   cNf 34497723            1
cNf_3   cNf 31947553            1
cnf_3   cnf 30167133            1
Cnf_3   Cnf 33780481            1
CNf_1   CNf 30427781            1
Cnf_1   Cnf 29577460            1
Cnf_2   Cnf 30405729            1
CNf_3   CNf 25049585            1
CNf_2   CNf 28496828            1
CNF_3   CNF 18823797            1
CNF_2   CNF 20700940            1
Design matrix not provided. Switch to the classic mode.

Clustering sur les gènes DE

On utilise des modèles de mélange pour regrouper les gènes ayant des profils d’expression similaires dans les différentes conditions.

AT5G46900 AT2G39040 AT4G16370 AT5G46890 AT1G51830 AT3G22910 AT1G03870 AT5G17760 
    25044     18629    150324     23563     35374     20583     89198     69411 
AT4G22460 AT1G05660 AT1G69880 AT1G29100 AT3G60420 AT1G09932 AT5G28510 AT1G65570 
     9066      6423     13183     15636      9823     48109      9953      8104 
AT1G74590 AT4G33550 AT5G53486 AT1G53830 AT1G12080 AT5G24100 AT2G29460 AT1G52130 
    27498      7100      5780     34734    102672     13354     13362      8162 
AT2G43510 AT5G39580 AT1G43910 AT1G56430 AT5G11920 AT1G06120 AT1G51840 AT1G13520 
     9483     74122     36277     34411     61112     10592      8682     32308 
AT4G17030 AT4G21680 AT1G78780 AT3G53540 AT5G24210 AT4G30170 AT3G63380 AT5G13320 
    92957     45645     12538     42649      2049    292312     39178     27057 
AT2G18660 AT3G46400 AT1G53940 AT3G12230 AT5G11210 AT3G53150 AT4G32950 AT1G44130 
     3177      6114     31564      7104      7171     17026     42453      3119 
AT5G06730 AT1G66090 AT4G15290 AT4G33790 AT1G12200 AT4G12480 AT2G14247 AT5G52760 
    36633      5588     17498     12613     66910     79987     25176      5517 
AT1G63750 AT5G13750 AT5G53250 AT2G26560 AT1G26390 AT3G23730 AT1G27140 AT4G32810 
    13249     49596     49109     98987     14295     15400      4996     13727 
AT4G12520 AT5G49690 AT5G44130 AT1G35230 AT1G51420 AT5G02490 AT5G25810 AT3G45700 
    21121      5561     16659      4903     76848     38568     36170      5880 
AT1G48260 AT2G30660 AT3G26610 
    20904     95158      9610 
 [ reached getOption("max.print") -- omitted 4078 entries ]
****************************************
coseq analysis: Poisson approach & none transformation
K = 8 to 12 
Use set.seed() prior to running coseq for reproducible results.
****************************************
Running g = 8 ...
[1] "Initialization: 1"
[1] "Log-like diff: 878.404180648169"
[1] "Log-like diff: 651.349925622305"
[1] "Log-like diff: 805.395049068972"
[1] "Log-like diff: 434.786231420049"
[1] "Log-like diff: 966.648418756102"
Running g = 9 ...
[1] "Initialization: 1"
[1] "Log-like diff: 0.582796121679099"
[1] "Log-like diff: 0.0321640837230284"
[1] "Log-like diff: 0.0137844118717094"
[1] "Log-like diff: 0.00252795050268162"
[1] "Log-like diff: 0.00051506274905222"
Running g = 10 ...
[1] "Initialization: 1"
[1] "Log-like diff: 683.054035391224"
[1] "Log-like diff: 271.024133213778"
[1] "Log-like diff: 581.694897322395"
[1] "Log-like diff: 656.650402574313"
[1] "Log-like diff: 1288.60256562109"
Running g = 11 ...
[1] "Initialization: 1"
[1] "Log-like diff: 908.727295904043"
[1] "Log-like diff: 314.87446940178"
[1] "Log-like diff: 1709.74488523059"
[1] "Log-like diff: 351.701362997625"
[1] "Log-like diff: 1193.43828507614"
Running g = 12 ...
[1] "Initialization: 1"
[1] "Log-like diff: 380.573834136091"
[1] "Log-like diff: 161.079292460031"
[1] "Log-like diff: 140.210806395253"
[1] "Log-like diff: 48.1565928423826"
[1] "Log-like diff: 87.7598594277771"
$logLike


$ICL


$profiles


$boxplots


$probapost_boxplots


$probapost_barplots


$probapost_histogram

*************************************************
Model: Poisson
Transformation: none
*************************************************
Clusters fit: 8,9,10,11,12
Clusters with errors: ---
Selected number of clusters via ICL: 12
ICL of selected model: -4056350
*************************************************
Number of clusters = 12
ICL = -4056350
*************************************************
Cluster sizes:
 Cluster 1  Cluster 2  Cluster 3  Cluster 4  Cluster 5  Cluster 6  Cluster 7 
       264        706        407        142        363        271        195 
 Cluster 8  Cluster 9 Cluster 10 Cluster 11 Cluster 12 
       634        603        174         73        321 

Number of observations with MAP > 0.90 (% of total):
3971 (95.62%)

Number of observations with MAP > 0.90 per cluster (% of total per cluster):
 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7
 258       684       378       137       348       258       179      
 (97.73%)  (96.88%)  (92.87%)  (96.48%)  (95.87%)  (95.2%)   (91.79%) 
 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12
 615       574       169        69         302       
 (97%)     (95.19%)  (97.13%)   (94.52%)   (94.08%)  

ACP

On représente le clustering dans la plan principal d’une ACP

Class: pca dudi
Call: dudi.pca(df = log(dataC + 0.1), center = TRUE, scale = TRUE, 
    scannf = FALSE, nf = 4)

Total inertia: 24

Eigenvalues:
    Ax1     Ax2     Ax3     Ax4     Ax5 
19.6234  3.1216  0.3283  0.1563  0.1041 

Projected inertia (%):
    Ax1     Ax2     Ax3     Ax4     Ax5 
81.7643 13.0069  1.3679  0.6513  0.4337 

Cumulative projected inertia (%):
    Ax1   Ax1:2   Ax1:3   Ax1:4   Ax1:5 
  81.76   94.77   96.14   96.79   97.22 

(Only 5 dimensions (out of 24) are shown)

Il semblerait que l’ACP détecte dans un premier temps (avec le premier vecteur principal) la valeur moyenne d’expression, puis l’expression diff induite par le fer.

On peut faire des représentations dans le plan des seconds et troisième axes principaux (le premier traduit le fer, le second la carence en nitrate).

Sur le cercle de corrélations dans le plan 2 et 3, on voit bien que lorsque la carence fer est là, on peut différencier un effet nitrate. Le CO2 est, lui encore difficile à identifier, le 4eme axe de l’ACP ne renseignant pas beaucoup…

 

A work by Océane Cassan

oceane.cassan@supagro.fr